Fernandez et al. BMC Genomics 201 1, 12:604 
http://www.biomedcentral.eom/1 471 -21 64/1 2/604 



RESEARCH ARTICLE Open Access 



Subfunctionalization reduces the fitness cost of 
gene duplication in humans by buffering dosage 
imbalances 

Ariel Fernandez 1,2,3 *, Yun-Huei Tzeng 4 and Sze-Bi Hsu 5 



Abstract 

Background: Driven essentially by random genetic drift, subfunctionalization has been identified as a possible 
non-adaptive mechanism for the retention of duplicate genes in small-population species, where widespread 
deleterious mutations are likely to cause complementary loss of subfunctions across gene copies. Through 
subfunctionalization, duplicates become indispensable to maintain the functional requirements of the ancestral 
locus. Yet, gene duplication produces a dosage imbalance in the encoded proteins and thus, as investigated in this 
paper, subfunctionalization must be subject to the selective forces arising from the fitness bottleneck introduced 
by the duplication event. 

Results: We show that, while arising from random drift, subfunctionalization must be inescapably subject to 
selective forces, since the diversification of expression patterns across paralogs mitigates duplication-related dosage 
imbalances in the concentrations of encoded proteins. Dosage imbalance effects become paramount when 
proteins rely on obligatory associations to maintain their structural integrity, and are expected to be weaker when 
protein complexation is ephemeral or adventitious. To establish the buffering effect of subfunctionalization on 
selection pressure, we determine the packing quality of encoded proteins, an established indicator of dosage 
sensitivity, and correlate this parameter with the extent of paralog segregation in humans, using species with larger 
population -and more efficient selection- as controls. 

Conclusions: Recognizing the role of subfunctionalization as a dosage-imbalance buffer in gene duplication events 
enabled us to reconcile its mechanistic nonadaptive origin with its adaptive role as an enabler of the evolution of 
genetic redundancy. This constructive role was established in this paper by proving the following assertion: If 
subfunctionolizotion is indeed adoptive, its effect on porolog segregation should scale with the dosage sensitivity of the 
duplicated genes. Thus, subfunctionalization becomes adaptive in response to the selection forces arising from the 
fitness bottleneck imposed by gene duplication. 




Genomics 



Background 

A shift in understanding the evolutionary forces that 
shape the human genome architecture took place when 
the retention of duplicate genes, a major factor in fos- 
tering genome complexity, was recognized to be primar- 
ily promoted by random genetic drift [1,2]. Thus, the 
evolution of genetic redundancy in human and in other 
higher eukaryotes is enabled by subfunctionalization, a 
preservation process driven by mildly degenerative 
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mutations that cause complementary loss of subfunc- 
tions in different gene copies. These typically dissimilar 
effects promote the separation of duplicates across cell 
types or developmental phases, thus making them indis- 
pensable to maintain the functional requirements of the 
ancestral locus. In subfunctionalization, expression-regu- 
latory elements are essentially lost through complemen- 
tary loss-of-function mutations in paralogs, leading to a 
partitioning of the function across tissues or develop- 
mental phases. Thus, this nonadaptive mechanism is 
essentially constructive [3], and is enabled by selection 
inefficiency, which is expected given the small size of 
the human population [1,3,4]. 
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Yet, as shown in this work, the retention of gene 
duplicates through subfunctionalization must also 
encompass adaptive elements. This is so because dosage 
imbalances arise in the concentrations of the encoded 
proteins as a result of gene duplication events and the 
deleterious effects of such imbalances can be mitigated 
when paralogs are physically separated by subfunctiona- 
lization. Dosage imbalances occur when protein concen- 
tration levels at specific tissue locations do not fit the 
stoichiometry of the complexes in which the proteins 
are involved [5,6]. The complexes may be transient or 
obligatory with regards to maintaining the structural 
integrity of the protein. Therefore, dosage sensitivity, 
that is, the fitness impact of dosage imbalance, must be 
determined by the extent of functional reliance of the 
protein on associations [7]. 

In this work we hypothesize that duplication of 
dosage-sensitive genes imposes a selection pressure on 
the fate of the duplicates that is buffered through sub- 
functionalization. Thus, although originated in random 
drift, subfunctionalization cannot, and in effect does not, 
escape the selection forces but rather becomes adaptive 
to mitigate the fitness bottleneck imposed by the gene 
duplication event. To validate this hypothesis, we iden- 
tify a molecular attribute of proteins that is indicative of 
their dosage sensitivity, thereby quantifying the impact 
of dosage imbalance effects on the evolution of genetic 
redundancy. Thus, this work is devoted to prove the fol- 
lowing assertion: If subfunctionalization is indeed adap- 
tive, its effect on paralog segregation should scale with 
the dosage sensitivity of the duplicated genes. As shown 
in this work, this is indeed the case, and in this way, the 
adaptive nature of subfunctionalization is shown to arise 
from the imbalance-buffering nature of the process. 

Since unicellular organisms lack the buffer of expres- 
sion diversification, selection pressure on duplicate 
genes is frequently enough to eliminate one of the 
duplicates, especially for genes with high dosage sensi- 
tivity. Proof of this is the significant decrease in family 
size with dosage sensitivity encountered in unicellular 
eukaryotes when compared with higher eukaryotes [7]. 
Thus, gene duplicates in unicellular organisms are sub- 
ject to higher purifying selection than their counterparts 
in multicellular eukaryotes. The scope of this work is to 
show that subfunctionalization is one of the buffering 
mechanisms that enable paralog survival in multicellular 
eukaryotes. 

To assess the adaptive contribution to subfunctionali- 
zation, it becomes essential to introduce a molecular 
indicator of dosage sensitivity. As shown in previous 
work [7], dosage imbalance effects are quantified by 
under-wrapping (v), a measure of the packing quality of 
soluble gene products that determines the extent of reli- 
ance of the protein on binding partnerships to maintain 



its structural integrity [8-12]. Specifically, v defines in a 
structure-averaged way the level of hindrance of struc- 
ture-disruptive backbone hydration. This parameter can 
be determined directly from protein structure by identi- 
fying the percentage of backbone hydrogen bonds 
(BHBs) that are unburied -the so-called dehydrons- and 
hence poorly protected from competing hydration of the 
amide and carbonyl [9]. Dehydrons constitute packing 
deficiencies since they are incompletely "wrapped" by 
the side-chain nonpolar groups that promote exclusion 
of surrounding water. Thus, for an individual gene, we 
get v = (#dehydrons)/(#BHBs) where the quotient 
extends over all gene products or encoded proteins. 
Dehydrons are markers of compulsory protein associa- 
tions that play a structure-protective role by promoting 
their inter-molecular dehydration [9-12]. Upon protein- 
protein association, the side-chain nonpolar groups of 
the binding partner penetrate the microenvironment of 
the dehydron, contributing to improve its wrapping [12]. 
This dehydration stabilizes the hydrogen bond in -3.9 
kj/mol [10]. 

In practice, given the dearth of structurally reported 
structures when compared with proteome size, dehy- 
drons are often identified from protein sequence using 
machine-learning methods of inference (Materials and 
Methods). The rationale for this approach is that, being 
local indicators of structural disruption, dehydrons 
belong to a twilight zone between order and disorder 
that can be identified using a reliable sequence-based 
predictor of disorder propensity such as PONDR [13]. 

Recent cross-examination of structural and evolution- 
ary data revealed that duplicates of genes encoding for 
under-wrapped proteins are exposed to higher deleter- 
ious pressure than gene duplicates coding for well- 
wrapped products. Thus, v serves as a proxy for dosage 
sensitivity, as confirmed by a statistically significant 
negative correlation between family-averaged v (<v>) 
and family size [7]. 

Paralog survival is dependent on v with P < 10" 17 in 
unicellular organisms, P < 10" 6 in fly and worm, but P < 
6.7 x 10" 3 in human (Wilcoxon rank test) [7]. This con- 
trast between simple and complex organisms is likely to 
arise due to the higher complexity of expression regula- 
tion in higher eukaryotes. The translation complexity 
may enable a buffer to dosage imbalance not likely to be 
found in unicellular organisms. By focusing on evolu- 
tion-related dosage imbalances, our results corroborate 
this hypothesis. 

The validation of the results asserting the adaptive 
component of subfunctionalization rests squarely on the 
legitimacy of under-wrapping as a proxy for dosage sen- 
sitivity. Evidence inversely correlating gene family size 
and under-wrapping [7], evidence arising from analysis 
of the mechanisms that buffer dosage imbalances in 
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humans [8], and evidence on the reliance of under- 
wrapped proteins on binding partnerships to maintain 
their structural integrity [12], all uphold the validity of 
under-wrapping as a molecular indicator of dosage sen- 
sitivity. Nevertheless, a control becomes essential to vali- 
date the conclusions of this study. As it turns out, this is 
the same control that serves to validate the molecular 
marker adopted [7] and arises from the following ratio- 
nale: If a specific gene duplication is actually part of a 
macro-scale event of whole genome duplication (WGD), 
we expect little or no selection pressure arising from 
dosage sensitivity since a WGD does not generate a 
dosage imbalance. Hence, the expression divergence 
brought about by subfunctionalization of gene duplicates 
arising from a WGD should result only from random 
genetic drift, with a minor adaptive contribution. This is 
indeed the case, as shown in this work. 

Results and discussion 

Adaptive subfunctionalization 

We identified the adaptive component of subfunctionali- 
zation by determining the extent of paralog segregation 
through dissimilar mRNA expression as a function of 
<v>, normalizing for the divergence time of each family. 
To support the conjecture of adaptive subfunctionaliza- 
tion in human, we generated an exhaustive database 
combining genetic, mRNA-expression and wrapping 
information on human genes and focused on differences 
in partial degradation of regulatory elements for mRNA- 
expression across paralogs, a causative of paralog segre- 
gation. Only 1957 human gene families from Ensembl 
Genome Database NCBI36 [14] and reported expression 
information [15] were found to have coding regions 
with sustainable ordered structure for free (uncom- 
plexed) subunits. Genes with ORFs coding for disor- 
dered regions were excluded from analysis since lack of 
sustainable structure implies that no wrapping assess- 
ment is possible, and structure may only be induced 
upon association. No paralog protein subunits forming 
intra-family complexes were found in the database, 
hence the analysis is free from this confounding factor 
since obligatory complexation would force coexpression 
of subunits. 

We regarded expression dissimilarities across paralogs 
as the means of avoiding competing for binding partners 
upon gene duplication. Three attributes of human gene 
families were considered: <rj>, the mRNA-level expres- 
sion diversification averaged over paralogs, <v>, and Ks, 
the synonymous nucleotide divergence, a proxy for 
divergence time [16]. For the gene families under con- 
sideration we obtain Ks < 2, hence we expect minor 
saturation effects (cf. [8]). Since paralog divergence is 
reflective of divergence time, the selection pressure 
quantified by <v> is normalized to Ks <v>, and the 



buffering effect resulting from subfunctionalization is 
established by plotting Ks <v> versus <rj> (Figure 1A). 

Paralog diversification <rj> was estimated by the Pear- 
son coefficient for gene expression vectors correspond- 
ing to each paralog pair. For two expression vectors X 
and Y, this coefficient is 

n(X Y) - <{X-<X>){Y-<Y>)> 

V< X 2 > - < X> V< Y 2 > - < Y> 2 

where X, Y are generic coordinates of vectors X and Y 
respectively, and < > in the equation indicates mean 
over cell types. 

Expression diversification in human is more pro- 
nounced for genes with high dosage sensitivity in conso- 
nance with the hypothesis that subfunctionalization, 
essentially a nonadaptive process, mitigates dosage 
imbalance effects. The results (Figure 1A) reveal a sig- 
nificant linear correlation (R 2 = 0.43), implying that 
paralog segregation through subfunctionalization into 
non-overlapping mRNA-expression patterns becomes 
enhanced in accord with the dosage sensitivity of the 
gene duplicates (P < 2.2 x 10" 5 ). This segregation is 
needed to avoid dosage imbalances whose effects scale 
with <v>. 

Control analyses were carried out for fly (Drosophila 
melanogaster), worm (Caernohabditis elegans) and yeast 
(Saccharomyces cerevisiae), for which genetic and 
expression data distributed across tissue or developmen- 
tal phases is available and may be combined with disor- 
der-based estimations of <v> (Materials and methods). 
Only 1354, 2137, 1391 non-singleton gene families in 
yeast, worm and fly, respectively, were examined as they 
have been found to have all coding regions with sustain- 
able ordered structure for free subunits (Additional file 
1). The data on these species endowed with higher 
selection efficiency reveals that paralog segregation 
becomes more sensitive and more tightly correlated to 
differences in dosage sensitivity and variations in diver- 
gence time, as attested by the quadratic dependence in 
the Ks <v>-<r|> plot (Figures 1B-D, P < 10" 7 for fly and 
worm, P < 10" 9 for yeast). To contrast the sensitivity of 
paralog segregation of human relative to control species, 
we define the family-associated segregation parameter S 
= 1/2(1- <rj>) (0 < S < 1), and plot its Ks-normalized 
value versus <v> for all four species, grouping families in 
10% <v> -ranges (Figure 2). As expected, paralog-segre- 
gation sensitivity increased in the order human < fly ~ 
worm < yeast, roughly following the species selection 
efficiency associated with population size [12]. 
Figure 2 incorporates a control analysis of paralog segre- 
gation in a scenario where duplicate genes arise from a 
whole-genome duplication (WGD) event in yeast [17]. 
This control is relevant since a WGD does not create a 
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Figure 1 Paralog segregation buffers dosage imbalance effects and hence scales with dosage sensitivity. Paralog segregation within a 
gene family is described by expression correlation parameter <r|>, while dosage sensitivity is indicated by <v>, the average underwrapping of 
gene products in the family. The <r|>-<v> interdependence is normalized by the divergence time of the family, indicated by Ks. Plot of Ks <v> 
versus <r|> for 1957 human families (A), 1391 fly families (B), 2137 worm families (C) and 1354 yeast families (D) with combined genetic, 
expression and structural information (Materials and Methods). The correlation coefficient R 2 was obtained by regression analysis. 
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Figure 2 Paralog segregation sensitivity S as function of dosage sensitivity <v>. The results are normalized by divergence-time parameter 
Ks. Gene families for all four species are grouped by 10% <v> -ranges and the S/Ks values are averaged over each bin for each species. 
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dosage imbalance and hence duplicates arising from a 
WGD are expected to be subject to little or no selection 
pressure arising from dosage sensitivity. If our hypoth- 
esis is correct as the previous analysis suggests, the 
expression divergence of duplicates resulting from 
WGD and brought about by subfunctionalization should 
result only from random genetic drift, with a minor 
adaptive contribution. This implies that S should be 
approximately proportional to the divergence time and 
independent of <v>, or S/Ks should remain approxi- 
mately constant and low relative to the level of segrega- 
tion experienced by duplications that generate dosage 
imbalances. This is indeed the case, as shown in Figure 
2, leading to the conclusion that in the absence of 
dosage-related selection pressure, subfunctionalization is 
indeed the result of random genetic drift (P < 4 x 10" 5 ), 
as postulated by Lynch and co-workers [1]. 

Conclusions 

The preservative role of subfunctionalization in humans 
and other higher eukaryotes is the result of mildly 
degenerative mutations likely to cause a differentiating 
degradation of expression-regulatory elements in gene 
duplicates. As shown in this work, this process, mechan- 
istically nonadaptive, is subject to the forces of selection 
and thus develops an adaptive component. This observa- 
tion motivates the present analysis of the contradictory 
aspects of constructive neutrality. 

Subfunctionalization is nonadaptive insofar as mildly 
deleterious mutations arise and are fixed in the species 
population through the vagaries of random genetic drift, 
and adaptive since subfunctionalization becomes also a 
buffer of the dosage imbalances that arise from gene 
duplication. When compared with species with higher 
selection efficiency, paralog segregation in human is not 
nearly as complete or efficient for families with high 
dosage sensitivity. Yet, the results from Figures 1A and 
2 reveal a significant adaptive role of human subfunctio- 
nalization when regarded as a buffer of the effects of 
dosage imbalance quantified by the gene dosage sensitiv- 
ity. Thus, the statistical analysis presented in this work 
unravels the fact that a process that is mechanistically 
nonadaptive when viewed as enabler of duplicate reten- 
tion may have adaptive consequences since it also serves 
to mitigate the selection pressure arising from duplica- 
tion events. 

This picture is further validated by examining a sce- 
nario in which gene duplications do not generate dosage 
imbalances. Such is the case with whole genome dupli- 
cation (WGD) [17]. In this case, we expect and corrobo- 
rate that the dominant evolutionary force leading to 
paralog segregation through subfunctionalization is ran- 
dom genetic drift. 



The mechanistic effects of population size on the effi- 
cacy of subfunctionalization were emphasized by Lynch 
[1], and are clearly confirmed in our study (Figures 1, 
2). To further test this dependence, it would be desir- 
able to contrast paralog segregation in endosymbionts 
versus the segregation undergone by the orthologs of 
these paralogs in the free species. However, the expres- 
sion of genes in an endosymbiont is highly coordinated 
and correlated with gene expression in the host [18], 
thereby masking the effects of population size on para- 
log segregation. 

Methods 

Gene information was obtained from the following 
sources: Saccharomyces cerevisiae (strain S288C), Sac- 
charomyces Genome Database http://www.yeastgenome. 
org/ (SGD1.01); Caenorhabditis elegans, WormBase 
http://www.wormbase.org/ (WB170); Drosophila mela- 
nogaster, Berkeley Drosophila Genome Project http:// 
www.fruitfly.org/ (BDGP 4.3); Homo sapiens, Ensembl 
Genome Database (NCBI36). Using the Ensembl gene 
family annotation [14], 6,024 yeast genes were grouped 
into 4,661 families, 20,173 worm genes were grouped 
into 11,503 families, 14,116 fly genes were grouped into 
9,477 families, and 22,357 human genes were grouped 
into 12,394 families. 

Gene expression data for different species were 
obtained from different sources: Novartis Gene Expres- 
sion Atlas [15] for human, FlyAtlas for fly [19], 
PUMAdb for worm [20], Saccharomyces Genome Data- 
base for yeast [21]. For human, the gene expression 
dataset contains expression levels across a panel of 73 
normal human tissues (samples of the 6 cancer-related 
tissues were not included). The PUMAdb dataset con- 
tains gene expression levels for worm at 6 different 
developmental time points (egg, LI, L2, L3, L4, and 
young adult) in two different strains (N2 and CB4856). 
The Saccharomyces Genome Database contains yeast 
mRNA expression levels during the 5 metabolic adapta- 
tion phases representing the transition from glucose-fer- 
mentative to glycerol-based respiratory growth. 
Paralogous genes arising from yeast WGD were 
obtained from Kellis et al. [22]. 

Synonymous nucleotide divergence, Ks, across paralog 
pairs was determined using the PAML package [23]. Its 
relevance as a surrogate for divergence time in a gene 
family is clearly delineated in [16]. 

The wrapping of a backbone hydrogen bond, was 
computed directly from PDB structural coordinates for 
gene products whenever available [8,11]. This local para- 
meter is computed by determining the number of side- 
chain nonpolar groups contained within a desolvation 
domain around the bond. This domain was defined as 
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two intersecting spheres of fixed radius (-thickness of 
three water layers) centered at the a -carbons of the resi- 
dues paired by the hydrogen bond. In structures of solu- 
ble proteins, backbone hydrogen bonds are protected on 
average by £ = 26.6 ± 7.5 nonpolar groups for a desolva- 
tion sphere radius 6A. Dehydrons lie in the tails of the 
distribution, i.e. their microenvironment contains 19 or 
fewer nonpolar groups < 19), so their Rvalue is below 
the mean minus one standard deviation. 

The parameter v can be determined from protein 
sequence, an imperative given the scarcity of structural 
information relative to proteome sizes. Since they repre- 
sent structural vulnerabilities, dehydrons belong to a 
twilight zone between order and disorder [12]. This 
characterization is suggested by a strong correlation 
between two local parameters: wrapping (Q, giving the 
number of protective nonpolar groups around the BHB, 
and propensity for structural disorder (f d ) [11,12]. The 
correlation reflects the fact that the propensity for back- 
bone hydration is indicative of a propensity for structure 
disruption. The parameter f d is a sequence-based score 
generated by the program PONDR-VLXT [24], a predic- 
tor of native disorder that takes into account residue 
attributes and their distribution within the window 
interrogated [13]. The disorder score (0 < f d < 1) is 
assigned to each residue within a sliding window, repre- 
senting the predicted propensity of the residue to be in 
a disordered region (f d = 1, certainty of disorder; f d = 0, 
certainty of order). The strong correlation between the 
disorder score of a residue and wrapping of the hydro- 
gen bond engaging the residue (if any) provides a 
sequence-based method of inference of dehydrons and 
supports the picture that such bonds belong to an 
order-disorder twilight zone. Thus, dehydrons can be 
inferred in regions where the disorder score lies in the 
range 0.35 < f d < 0.95, which corresponds to a marginal 
BHB wrapping with 7 < £ < 19. 

Additional material 



Additional file 1: Non-singleton gene families in yeast, worm, fly 
and human with coding regions able to sustain free subunits with 
ordered structure. 
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